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Provenance is information recording the source, derivation, or history of some information. 
Provenance tracking has been studied in a variety of settings, particularly database management 
systems; however, although many candidate definitions of provenance have been proposed, the 
mathematical or semantic foundations of data provenance have received comparatively little 
attention. In this article, we argue that dependency analysis techniques familiar from program 
analysis and program slicing provide a formal foundation for forms of provenance that are intended 
to show how (part of) the output of a query depends on (parts of) its input. We introduce a semantic 
characterization of such dependency provenance for a core database query language, show that 
minimal dependency provenance is not computable, and provide dynamic and static approximation 
techniques. We also discuss preliminary implementation experience with using dependency 
provenance to compute data slices, or summaries of the parts of the input relevant to a given part of 
the output. 



1. Introduction 



Provenance is information about the origin, ownership, influences upon, or other historical or 
contextual information about an object. Such information has many applications, including eval- 
uating integrity or authenticity claims, detecting and repairii ig errors, and inemoizing and caching 
the results of comput ations such as scientific workflows I Lynch . 2000[ Bose and Frew . 2005 . 



Simmhan et al.ll2005ll . Provenance is particularly important in scientific computation and record- 
keeping, since it is considered esse ntial for ensuring the rep eatabiUty of experiments and judging 



the scientific value of their results iBuneman et all 12008; 



Most computer systems provide simple forms of provenance, such as the timestamp and own- 
ership metadata in file systems, system logs, and version control systems. Richer p rovenance 



tracking techniques have been studied in a variety of settings, includin g databases ICui et al 



2000 , lBuneman et al.iboOliboOSb llFoster et al.U2008l.lGreen et al.l, l20()7ll. file systems iMuniswamv-Reddv et al 
200611 . and scientific workflows iBose and Frewn2005t, jSimmhan et al.ll2005ll . Although a wide 
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variety of design points have been explored, there is relatively little understanding of the relation- 
ships among techniques or of the design considerations that should be taken into account when 
developing or evaluating an approach to provenance. The mathematical or semantic foundations 
of data provenance have received comparatively little attention. Most prior approaches have in- 
voked intuitive concepts such as contribution, influence, and relevance as motivation for their 
definitions of provenance. These intuitions suggest definitions that appear adequate for simple 
(e.g. conjunctive) relational queries, but are difficult to extend to handle more complex queries 
involving subqueries, negation, grouping, or aggregation. 

However, these intuitions have also motivated rigorous ap proaches to seemingly quite differ- 
ent problems, such as aiding debugging via program slicing IIBiswas . 



Weisen. Il98lll . supporting efficient memoization and caching lAbadi et al 



1997 



Field and Tip 



1199 



1 



1998 



Acar et al 



2003 1. an d improving progra m security using information flow analysis llSabelfeld and Myers . 



200311 . As lAbadi et al.l 11 199911 have argued, slicing, information flow, and several other program 



analysis techniques can all be understood in terms of dependence. In this article, we argue that 
these dependency analysis and slicing techniques familiar from programming languages provide 
a suitable foundation for an interesting class of provenance techniques. 

To illustrate our approach, consider the input data shown in Figure [TJ a) and the SQL query in 
Figure[TJb) which calculates the average molecular weights of proteins involved in each reaction. 
The result of this query is shown in Figure[TJc). The intuitive meaning of the SQL query is to find 
all combinations of rows from the three tables Protein, EnzymaticReaction, and Reaction such 
that the conditions in the WHERE-clause hold, then group the results by the Name field, while 
averaging the MW (molecular weight) field values and returning them in the AvgMW field. 

Since the MW field contains the molecular weight of a protein, it is clearly an error for the 
italicized value in the result to be negative. To track down the source of the error, it would be 
helpful to know which parts of the input contributed to, or were relevant to, the erroneous part 
of the output. We can formalize this intuition by saying that a part of the output depends on a 
part of the input if a change to the input part may result in a chan ge to the outpu t part. This is 
analogous to the notion of dependence underlying program slicing OWeiserl 1198111 . a debugging 
aid that identifies the parts of a program on which a program output may depend. 

In this example, the input field values that the erroneous output AvgMW-value depends on are 
highlighted in bold. The dependencies include the two summed MW values and the ID fields 
which are compared by the selection and grouping query. These ID fields must be included be- 
cause a change to any one of them could result in a change to the italicized output value — for 
example, changing the occurrence of in table EnzymaticReaction would change the average 
molecular weight in the second row. On the other hand, the names of the proteins and reac- 
tions are irrelevant to the output AvgMW — no changes to these parts can have any effect on the 
italicized value, and so we can safely ignore these parts when looking for the source of the error. 

This example is simplistic, but the ability to concisely explain which parts of the input influ- 
ence each part of the output becomes more important if we consider a large query to a realistic 
database with tens or hundreds of columns per table and thousands or millions of rows. Manu- 
ally tracing the dependence information in such a setting would be prohibitively labor-intensive. 
Moreover, dependence information is useful for a variety of other applications, including esti- 
mating iht freshness of data in a query result by aggregating timestamps on the relevant inputs, or 
transferring quality annotations provided by users from the outputs of a query back to the inputs. 
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(b) 

SELECT R.Name as Name, AVERAGE (P . MW) as AvgMW 

FROM Protein P, EnzymaticReaction ER, Reaction R 

WHERE P.ID = ER. Proteinic, ER . ReactionID = R.ID 

GROUP BY R.Name 

(c) 



Name AvgMW 

t-p + ATP = td + ADP 15.75 

H2O + anap^p + ac -338.2 

D-r-5-p = D-r-5-p 18.1 



Fig. 1. Example (a) input, (b) query, and (b) output data; input field names and values 
relevant to the italicized erroneous output field or value are highlighted in bold. 



For example, suppose that each part of the database is annotated with a timestamp. Given a 
query, the dependence information shown in Figure[T]can be used to estimate the last modification 
time of the data relevant to each part of the output, by summarizing the set of timestamps of parts 
of the input contributing to an output part. Similarly, suppose that the system provides users with 
the ability to provide quality feedback in the form of star ratings. If a user flags the negative 
AvgMW value as being of low quality, this feedback can be propagated back to the underlying 
data according to the dependence information shown in Figure [T] and provided to the database 
maintainers who may find it useful in finding and correcting the error. In many cases, the user 
who finds the error may also be the database maintainer, but dependency information still seems 
useful as a debugging (or data cleaning) tool even if one has direct access to the data. 

In this article, we argue that data dependence provides a solid semantic foundation for a prove- 
nance technique that highlights parts of the input on which each part of the output depend. We 
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work in the setting of the nested relational calculus (NRC) iBuneman et aUll994ll995UWong , 



1996 1. a core language for database queries that is closely related to monad algebra IWadlei , 



199211 . The NRC provides all of the expressiveness of popular query languages such as SQL, 



and includes collection types such as sets or multisets, equipped with union, comprehension, dif- 
ference and equality operations. The NRC can also be extended to handle SQL's grouping and 
aggregation operations, and functions on basic types. We consider annotation-propagating se- 
mantics for such queries and define a property called dependency-correctness, which, intuitively, 
means that the provenance annotations produced by a query reflect how the output of the query 
may change if the input is changed. 

There may be many possible dependency-correct annotation-propagating queries correspond- 
ing to an ordinary query. In general, it is preferable to minimize the annotations on the result, 
since this provides more precise dependency information. Unfortunately, as we shall show, mini- 
mal annotations are not computable. Instead, therefore, we develop dynamic and static techniques 
that produce dependency-correct annotations that are not necessarily minimal. We have imple- 
mented these techniques and found that they yield reasonable results on small-scale examples; 
the implementation was used to generate the results shown in Figure [T] 



1.1. Prior Work on Provenance 

We first review the relevant previous work on provenance and contrast it with our approach. 
We provide a detailed comparison with prior work on program slicing and information flow in 
Section|6] 



a number of researchers, beginning in the earlv 1990s 1 Buneman et al. , 2001 


2OO2I 


2008b| 


Cui et al. 


2000, 


Green et al. , 


2007 


Wang and Madnick , 


1990, Woodruff and Stonebraker 


1997| 


Recent research on annotations, uncertainty, and incomplete information IBenjelloun et al. 


,2006 



Bhagwat et al. , 2005 , Geerts et al. , 2006ll has also drawn on these approaches to provenance; in 



particular, definitions of provenance have been used to justify annotation-propagation behaviors 
in these sy stems. We will focus on the differences between o ur work and the most recent work; 



please see iBuneman et al.l l2008all and IChenev et al.l 1200911 for more complete discussion of 



research on provenance in databases. 

Most prior work on provenance has focused on identifying information that explains why some 
data is present in the output of a query (or view) or where some data in the output was copied 
from in the input. However, satisfying semantic characterizations of these intuitions have proven 
elusive; indeed, many of the proposed definitions themselves have been unclear or ambiguous. 
Many proposed forms of provenance are sensitive to query rewriting, in that equivalent database 
queries may have different provenance behavior This raises a number of troubling issues, since 
database systems typically rewrite queries modulo equivalence, so the provenance of a query 
may change as a result of query optimization. Also, in part because of the absence of clear 
formal definitions and foundations, these approaches have been difficult to generahze beyond 
monotone relational queries in a principled way. 



In why- and where-provenance, introduced by IBuneman et al.l 11200 ill , provenance is studied 
in a deterministic tree data model, in which each part of the database can be addressed by a 
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unique path. lBuneman et al.l 1200111 considered two forms of provenance: why-provenance, which 
consists of a set of witnesses, or subtrees of the input that suffice to explain a part of the output, 
and whe re-provenance , which consists of a single part of the input from which a given part of 
the output was copied . Both forms of provenance are sensitive to query rewriting in general, but 



Buneman et a 



Green et al 



11200 ill discussed normal forms for queries that avoid this problem. 



1 200711 showed that relations with semiring- valued annotations on rows generalize 



several variations of the relational model, including set, bag, probabilistic, and incomplete in- 
formation models, and identifie d a relationship between free semiring-valued relations and why- 
provenance. iFostereLalJ II2OO8II extended this approach to handle NRC queries and an unordered 
variant of XML. These approaches also appear orthogonal to our approach, and in additional only 
consider annotations at the level of elements of collections, not individual fields or collections, 
a nd they do not handle ne gation or aggregation. 



Buneman et al.l ll2008bll introduced a model of where-provenance for the nested relational cal- 



culus. In their approach each part of the database is tagged with an optional annotation, or color, 
colors are propagated to the output so as to indicate where parts of the output have been copied 
from in the input. They studied the expressiveness of this model compared to queries that ex- 
plicitly manipulate annotations. They also investigated where-provenance for updates, which we 
discuss in Section [T.1.2l 

Our work is closest in spirit to the why-provenance and lineage techniques; however, in con- 
trast to these techniques our approach annotates every part of the database and provides clear 
semantic guarantees and qualitatively useful provenance information in the presence of nega- 
tion, grouping and aggregation. 



1.1.2. Provenance for database updates Some re cent work has generalized where-provenance 
to database updates 1 Buneman et al. , 2006[ 2008bll . motivated by ciirated scientific databases that 
are updated frequently, often by (manual) copying from other sources. This work has focused on 
recording the external sources of the data in a database and tracking how the data has been re- 
arranged within a database across successive versions. Accordingly, the provenance information 
provided by these approaches only connects data to exact copies in other locations, and does not 
track provenance throu gh other operations. In this sense, it is similar to the where-provenance 
approach considered bv lBuneman et al.l 0200 ill for database queries. 

Our approach addresses an orthogonal issue, that of understanding how data in the result of 
a query depends on parts of the input; we therefore track provenance through copies as well as 
other forms of computation. Although our definition of dependency correctness could also be 
used for updates, it is not clear whether this yields a useful form of provenance, and we plan to 
investigate alternative dependency conditions t hat are more suitable for updates, using the update 
language employed in llBuneman et al.l l2008bll . 



1.1.3. W orkflow provenance Provenance has also been studied in geospatial an d scientific com- 
putation I Bose and Frew , 2005 , Foster and Moreau , 20061 Simmhan et al. , 2005 j , particularly for 
workflows (visual programs wri tten by scientists). In their simplest form (see e.g. the Provenance 
Challenge llMoreau et al.ll2007ll '). workflows are essentially directed acyclic graphs (DAGs) rep- 
resenting a computation. For such DAG workflows, the provenance information that is typically 
stored is simply the workflow DAG, annotated with additional information, such as filenames 
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and timestamps, describing the arguments that were used to compute the resuh of interest. This 
corresponds to a simple form of dependency tracking, although as far as we know no research on 
workflow provenance has explicitly drawn this connection. 

However, many more sophisticated workflow programming models have been developed, in- 
volving concurrency and distributed computation. For these models, the appropriate correct- 
ness criteria for provenance tracking are much less clear In fact, the ordinary semantics of 
these systems is not always clearly specified. One principled approach recently introduced by 



Hidders et al.l 1200711 defines provenance for the nested relational calculus augmented with addi- 



tional function symbols that represent calls to scientific workflow components. Their approach 
has so far focused on defining provenance and not formulating or proving desirable correctness 
properties. We believe dependence analysis may provide an appropriate foundation for prove- 
nance in this and other workflow programming models. 



1.2. Contributions 

The main contribution of this article is the development of a semantic criterion called dependency- 
correctness that characterizes a form of provenance information we call dependency provenance. 
Dependency-correctness captures an intuition that provenance should link a part of the output to 
all parts of the input on which the output part depends, enabling us to make some predictions 
about the effects of changes to the input and to quickly identify source data that contributed to 
an error in the output. 

Building on this framework, we show that (unsurprisingly) it is undecidable whether some 
dependency-correct provenance information is minimal, and proceed to develop computable dy- 
namic and static techniques for conservatively approximating correct dependency provenance. 
These techniques, and their correctness proofs, are largely standard but the presence of data- 
base query language features and collection types introduces complications that have not been 
addressed before in work on information flow or program slicing. 



1.3. Organization 

The structure of the rest of this article is as follows. We review the syntax, type system and 
semantics of the nested relational calculus in Section |2] We then introduce (in Section O the 
annotation-propagation model, motivate and define dependency-correctness, and show that it is 
impossible to compute minimal dependency-correct annotations. In Section|4]we describe a dy- 
namic provenance-tracking semantics that is dependency-correct. We also (Section|5]) introduce 
a static, type-based provenance analysis which is less accurate than provenance tracking, but can 
be performed statically; we also prove its correctness relative to dynamic provenance tracking. 
We discuss experience with a prototype implementation in Section|6]and discuss future work and 
conclude in Section]?] 



2. Background 



We will provide a brief review of the nested relational calculus (NRC) OBuneman e 



a core database query language which is closely related to monad algebra OWadlei 



allll995ll . 



19921 The 
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nested relational calculus is a typed functional language with types t of the form: 

T ::= bool | int | ti x r2 | {t} 

We consider base types bool and int, along with product types ri x T2 and collection types {r}. 
Collection types typically are taken to be monads equipped with an addition operator (sometimes 
called ringads); typical examples used in databases include lists, sets, or multisets, and in this 
article we consider finite multisets (also known as bags). 
The expressions of our variant of NRC are as follows: 

e x | let x = ei in 62 | (ei, 62) | 7ri(e) | 7r2(e) 

I 5 I I ei A 62 I ei « 62 I if eo then ei else 62 

I i I 61 + 62 I sum(e) 

I I {e} I 61 U62 I 61 -62 I {62 I x e ei} I IJe 

Here, i G Z = {. . . , —1,0, 1, . . .} denotes integer constants and 6 G B = {true, false} de- 
notes Boolean constants. The bag operations include 0, the constant empty multiset; singletons 
{6}; multiset union U, difference — , and comprehension {62 | x G 61}; and flattening lj6. By 
convention, we write {61, . . . , 6„} as syntactic sugar for {61} U • ■ • U {en}- Finally, we include 
sum, a typical aggregation operation, which adds together all of the elements of a multiset and 
produces a value; e.g. sum{l, 2, 3} = 6. By convention, we take sum(0) = 0. We syntactically 
distinguish between NRC's equality operation « and mathematical equality ~. 

NRC expressions can be typechecked using standard techniques. Contexts F are lists of pairs 
of variables and types xi : ti, . . . ,Xn ■ Tn, where xi, . . . ,Xn are distinct. The rules for type- 
checking expressions are shown in Figure|2] 

We write A4f,r,{X) for the set of all finite multisets with elements drawn from X. The (stan- 
dard) interpretation of base types as sets of values is as follows: 

T|bool] = B = {true, false} 
riintl = Z = {.. .,-1,0,1,...} 

Tlnxrsl = T|Ti]xriT2] 

An environment 7 is a function from variables to values. We define the set of environments 
matching context F as r|r] = {7 | Vx G dom(F). j{x) G T|F(a;)]}. 

Figure [3] gives the semantics of queries. Note that we overload notation for pair projection tt^ 
and bag operations such as U and IJ; also, if 5 is a bag of integers, then •S' is the sum of their 
values (taking ^0 = 0). It is straightforward to show that 

Lemma 2.1. If F h 6 : t then Sfej : TfTj T|t]. 



Remark 2.1. As discussed in previous work IIBuneman et al.l 1199511 . the NRC can express a 



wide variety of queries including ordinary relational queries, nested subqueries, and grouping 
and aggregation queries. The core NRC excludes a number of convenient features such as records 
with named fields and comprehensions. However, these features can be viewed as syntactic sugar 
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x:t G r r h ei : n F, x:t-i h €2 : T2 



i e Z 



r h ei : int T h 62 : int 



r h 2; : r r h let a: 
The: {int} g 



ei in 62 : r2 F h i : int F h ei + 62 : int 

F h eo : boo! Fl-ei:r Fl-e2:r The: boo! 



r h sum(e) : int F h 6 : boo! F h if eo then ei else 62 : r F I : bool 

F h ei : bool F h 62 : bool F h ei : n F h e2 : r2 F h e : n x T2 



F h ei A 62 : bool 
F h ei : r F h 62 : r 



F h (ei, 62) : Ti X T2 
F h e : r 



(^£{1,2}) 

F h 7ri(ej : Ti 

F h ei : {r} F h 62 : {r} 



F h ei « e2 : bool F h : {r} F h {e} : {r} F h ei U 62 : {r} 

F h ei : {r} F h 62 : {r} F h ei : {n} F, a;:ri h 62 : r2 F h e : {{r}} 



Fhei-e2:{r} F h {ez | x G ei} : {r2} 

Fig. 2. Well-formed query expressions 
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f[ei Ue2|7 


= £Iei]7U£:[e2]7 
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= f[ei]7-f[e2]7 








5 1 a: G eo}]7 





e ^[eol7} 



f [if eo then ei else e2]7 = 
f [ei « 6217 = 
Fig. 3. Semantics of query expressions 



£■[6117 if £[eol7 = true 
£[6217 iffleoh^ false 

true if£:[ei]7 = £:[e2l7 
false if£:[eil7/£:[e2l7 



nA{R) 

aA=B{R) 
RxS 

nBE{aA=D{R X S)) 
RUpA/C.B/D(nCDiS)) 

R - Pa/d,b/e{^de{S)) 
sum(nA(i?)) 
count(_R) 



{x.A \ x £ R} 

[J{if x.A = x.B then {x} else | a; G i?} 

{{A : x.A, B : x.B, C : y.C, D : y.D, E : y.E) \ x € R,y £ S} 

{if x.A = y.D then {(B : x.B, E : y.E)} else %\x £ R,y £ S} 

RVj{{A:y.C,B:y.D)\y£S} 

R~{{A:y.D,B:y.E)\y£S} 

sum{a::.A | x £ R} 

sum{l I X £ R} 



Fig. 4. Example queries 
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for core NRC expressions. In particular, we use abbreviations such as; 

{e I xi e ei,X2 e 62} = |J{{e I ^2 £ £2} | Xi E ei} 

{e\xeeo,C} = |J{if C then {e} else I 2; e eo} 
{e I (cci, X2) G eo} = {let xi = 7ri(a:;), a:;2 = 7r2(x) in e | x G Cq} 

Additional base types, primitive functions and relations such as real, / : real x real real, and 
average : {real} real can also be added without difficulty. For example, using more readable 
named records, comprehensions, and pattern-matching, the SQL query fromFigure [lib) can be 
defined as 

letX = {{r.Name,p.MW) \ r E R,ere ER,p G P,er.RID = r.ID.p.ID = er.PID} in 
{{n, average{mu' | (n', mw) E X,n ~ n'}) \ {n, _) G X} 

Additional examples are shown in Figure |4] 

We do not consider other features of SQL such as operator overloading or incomplete infor- 
mation (NULL values). 



3. Annotations, Provenance and Dependence 



We wish to define dependency provenance as information relating each part of the output of a 
query to a set of parts of the input on which the output part depends. Collection types such as sets 
and bags are unordered and lack a natural way to address parts of values , so we must introduce 



one. One technique (f amiliar from many program analyses iNielson et al 



work on provenance IIBuneman et al.L l2008bi IWang and Madnickl 1199011 ) is to enrich the data 



200511 as well as other 



model with annotations that can be used to refer to parts of the value. In practice, the annotations 
might consist of explicit paths or addresses pointing into a particular representation of a part of 
the data, analogous to filenames and line number references in compiler error messages, but for 
our purposes, it is preferable to leave the structure of annotations abstract; we therefore consider 
annotations to be sets of colors, or elements of some abstract data type Color. 

We can then infer provenance information from functions on annotated values by observing 
how such functions propagate annotations; conversely, we can define provenance-tracking se- 
mantics by enriching ordinary functions with annotation-propagation behavior. However, for any 
ordinary function, there are many corresponding annotation-propagating functions so the ques- 
tion arises of how to choose among them. 

We consider two natural constraints on the annotated functions we will consider First, if we 
ignore annotations, the behavior of an annotated function should correspond to that of an ordinary 
function. Second, the behavior of the annotated functions should treat the annotations abstractly, 
so that we may view the colors as locations. We show that both properties follow from a single 
condition called color-invariance. 

We next define dependency-correctness, a property characterizing annotated functions whose 
annotations safely over-approximate the dependency behavior of some ordinary function. Such 
annotations can be used to compute a natural notion of "data slices", by highlighting those parts 
of the input on which a given part of the output may depend. It is clearly desirable to produce 
slices that are as small as possible. Unfortunately, minimal slices turn out not to be computable 
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since it is undecidable whether the annotations in the output of a dependency-correct function 
are minimal. In the next sections, we will show how to calculate approximate dynamic and static 
dependency information for NRC queries. 

We define annotated values (a-values) v, raw values (r-values) w, and multisets of annotated 
values V as follows: 

V ::= w ::= i \b \ {vi,V2) \V V ::= {vi, . . . ,Vn} 

Annotations are sets <& C Color of values from some atomic data type Color, called colors. We 
often omit set brackets in the annotations, for example writing w"''''''' instead of w^"''''''^^ and w 
instead of w®. An a- value v is said to be distinctly colored if every part of it is colored with a 
singleton set {a} and no color c is used more than once in v. 

For each type t, we define the set ^|r] of annotated values of type r as follows: 

^|bool] = {6* I & e 1} 

^lintj = {i* I z G Z} 

A[ti X T2I = {{vi,v2f I VI e ^|ti], t-2 e AIT2]} 

-4[{t}1 = {y* I Vf e V.v e Afr]} 

Annotated environments 7 map variables to annotated values. We define the set of annotated 
environments matching context r as ^|r] = {7 | Va; G dom(r).7(x) e ^|r(a;)]}. 

We define an erasure function | — |, mapping a-values to ordinary values (and, abusing notation, 
also mapping r-values to ordinary values), and an annotation extraction function || — || which 
extracts the set of all colors mentioned anywhere in an a-value or r-value, as follows: 




IklllU 1^211 

mv\\\veV} 

$ U 

Two a-values are said to be compatible (written v ^ v') if \v\ ~ \v'\; also, an a-value v is said to 
enrich an ordinary value v' (written v > v') provided |?;| — v' . 

We now consider annotated functions (a-functions) F : A\t\ — > ^|t'] on a-values. We say 
that a-function F enriches an ordinary function / : T|t] — > T|t'] (written F > /), provided 
that Vw e ^|r]./(|w|) = We will also consider annotated functions F : AlTj ^|r] 

mapping annotated environments to values. We say that an a-function F enriches an ordinary 
function / : T|r] ^ T|t] (again written F > /), provided that V7 G yt|r]./(|7|) = |F(7)|. 



\^\ = I M = 

\h\ = b \\b\\ = 

\{V^,V2)\ = \\{V1,V2)\\ = 

\{V}\ = {\v\\veV} \\{V}\\ = 

= \w\ ||«;*|| = 



3.1. Color-invariance 

Clearly, many exotic a-functions exist that are not enrichments of any ordinary function. For 
example, consider 



F«*) - { i: 



1* ($ = 0) 
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Here, F tests whether its annotation is empty or not, and there is no ordinary function / : 
T|int] T|int] such that Wv.\F{v)\ = f{\v\). In the rest of this article we will restrict at- 
tention to a-functions F that are enrichments of ordinary functions. 

In fact, we will restrict attention still further to a-functions whose behavior on colors is also 
abstract enough to be consistent with an interpretation of colors in the input as addresses for parts 
of the input. For example, consider G, H : ^|int] — > .4|int] having the following behavior: 

I' 

G(i*) 



($ = 0) 

($^0) 



where in both cases c is some fixed color. Both functions are enrichments of the ordinary identity 
function on integers, but both perform nontrivial computations on the annotations. If we 
wish to interpret the colors on these functions as representing sets of locations, then we want to 
exclude from consideration functions like G whose behavior depends on the size of the annota- 
tion set or functions like H whose behavior depends on a specific color 



By analogy with generic queries in relational databases lAbiteboul et al.LI 199511 . such a-functions 
ought to behave in a way that is insensitive to the particular choice of colors. Moreover, since 
a-values are annotated by sets of colors, the a-functions also ought to be insensitive to properties 
of the annotations such as nonemptiness or equality. In particular, we expect that the behavior of 
an a-function is determined by its behavior on distinctly-colored inputs. 

To make this precise, we first need to define some auxiUary concepts. 

Definition 3.1. An a-value v is distinctly-colored provided every subexpression w* we have 
$ = {c} for some color c, and no two subexpressions occurring in v have the same color. 

Example 3.1. For example, v — {(1°, l'')^}'' is distinctly-colored, while v' ~ {(!", l")'^}'* is 
not, because the color a is re-used in two different subexpression occurrences of 1". 

A color substitution is a function a : color {color} mapping colors to sets of colors. We 
can lift color substitutions to act on arbitrary a-values as follows: 

a{b) = b 
a{i) — i 

a{V) = {a{v) \veV} 

where a[<f>] = U{'^(c) I c £ Note that for any v G AIt], we have a{v) G AIt]; we 
sometimes write to indicate the restriction of a to -4|t]. 

Example 3.2. Continuing the previous example, consider the color substitution defined by a (a) = 

{b, c} and a{x) = {x} for x ^ a. Applying this substitution to v yields {(1°''', l'')'^}''. Applying 
to {!", 2"''*}'= yields 2'''"''^}''. 

We note some useful properties relating distinctly-colored values, color-substitution and the 
erasure and color-support functions; these are easy to prove by induction. 
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Lemma 3.1. Suppose a : color {color}. Then (1) \a{v)\ = \v\ and (2) ||a(w)|| = 

Lemma 3.2. Suppose v is an a-value. Then there exists a distinctly-colored vq = v and a color 
substitution qq such that ao(wo) = 

Accordingly, for each ordinary value w fix a distinctly-annotated dc(v); moreover, for each a- 
value fix a color substitution ay such that ay{dc\v\) = v. 

We now define a property call ed color-invariance, by analogy with the color-propagation 
studied in I Buneman et al. . 2008bll for annotations consisting of single colors. Color-invariance 
is defined as follows. 

Definition 3.2 (Color-invariance). An a-function F : A\ti\ A\t2\ is called color-invariant 
if whenever a : color {color} then we have a'^^ {F{v)) = F{a^^ (w)). 

As noted above, color-invariance has two important consequences. First, the behavior of a 
color-invariant function is determined by its behavior on distinctly-colored inputs. Second, color- 
invariant functions are always enrichments of ordinary functions. 

Proposition 3.1. If G : ^|ti] A\t2\ is color-invariant then the following are equivalent: 

1 F = G 

2 F{v) — G{v) for every distinctly-colored V G AfriJ. 

3 F{dc{v)) = G'(dc(?;)) for every ordinary value v G Tin]. 

Proof. The implications (1) ^ (2) => (3) are trivial. We show (3) implies (1). Let v G ^|ti] 
be given. Then v = a^(dc(|z;|)), so to prove F = G, it suffices to show: 

F{v) - F{ay{dc{\v\))) = a,(F(dc(|i;|))) = a„(G(dc(|t.|))) - G(a,(dc(|f |))) = G{v) 

Proposition 3.2. IfF : Alnj ^ ^|r2] is color-invariant then F > / where /(w) = |F(dc(f;))|. 

Proof. Let v E AfriJ be given. Then to prove F > f, observe: 

f{\v\) = \F{dc{\v\))\ = |a„(F(dc(K,|)))| = |FK(dc(|z;|)))| = |F(t;)| 

We write |F| for / provided F > f; clearly, / is unique when it exists, and |F| exists for any 
color-invariant F. 



3.2. Dependency-correctness 

We now turn to the problem of characterizing a-functions whose annotation behavior captures a 
form of dependency information. Intuitively, an a-function F is dependency-correct if its output 
annotations tell us how changes to parts of the input may affect parts of the output. First, we need 
to capture the intuitive notion of changing a specific part of a value. 

Definition 3.3 (Equal except at c). Two a-values vi,V2 are equal except at c {vi =c ^2) pro- 
vided that they have the same structure except possibly at subterms labeled with the color c; this 
relation is defined as follows: 

dGlUZ t^l =c V'l V2 =c v'2 Vl =c v'l ■■■ Vn =c v'n Wl =c W2 C G $1 H ^2 

d=cd {vi,V2) =c {v[,V2) {vi, . . . ,Vn} =c {v[, . . . ,v'„} wf =c wf u;f ^ =c 
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Furthermore, we say that two annotated environments 7, 7' are equal except at a (written 7 =a 
7') if their domains are compatible (dom(7) = dom(7')) and they are pointwise equal except at 
a, that is, for each x G dom(7), we have 7(.t) =a 7' (a;). 

Remark 3.1. For distinctly-colored values, a color serves as an address uniquely identifying a 
subterm. Thus, =c relates a distinctly-colored value to a value which can be obtained by mod- 
ifying the subterm located at c; that is, if we write vi as C[v[] where C is a context and v[ is 
the subterm labeled with c in vi, and vi =c V2, then V2 ~ C[v2] for some subterm v'2 labeled 
with c. Note that and V2 need not be distinctly colored, and that =c makes sense for arbitrary 
a-values, not just distinctly colored ones. 

Example 3.3. Consider the two a-environments: 

7 = (R:{(rS3^%5^-^)^\...r,S:-..) 
7' = (R:{(2^^3^^5^■^)''^...}^S:•••) 

We have 7 7', 7 =bi 7', and 7 7', assuming that the elided portions are identical. 

Definition 3.4 (Dependency-correctness). An a-function F : AfT] Afr] is dependency- 
correct if for any c e Color and 7, 7' G ^|r] satisfying 7 =c 7', we have F{^) =c F{Y)- 

Example 3.4. Recall 7, 7' as in the previous example. Suppose F is dependency-correct and 

Since 7 =cj^ 7', we know that ^(7) F{^') so we can see that F{Y) must be of the form 

for some x & Z. We do not necessarily know that x must be 2; this is not captured by dependency- 
correctness. 

Remark 3.2. Dependency-correctness tells us that for any c, we must have -^(7) ~ C[vi, . . . , w„] 
and F{j') = C[v[, . . . , w^], where C[— , ...,—] is a context not mentioning c and ui, . . . , w„, 
v'l, . . . ,v'„ are labeled with c. Thus, F's annotations tell us which parts of the output (i.e., 
vi, . . . ,Vn) may change if the input is changed at c. Dually, they also tell us what part of the 
output (i.e., C[— , . . . , — ]) cannot be changed by changing the input at c. 

We can consider the parts of the output labeled with c to be a forward slice of the input at c; 
it shows all of the parts of the output that may depend on c. Conversely, suppose the output is of 
the form C"[w*]. Then we can define a backward slice corresponding to this part of the output 
by factoring 7 into C[vi, . . . , v„] where C is as small as possible subject to the constraint that 
$ n II = for each i. This context C[— , ...,—] identifies all of the parts of the input on which 
a given part of the output may depend. 

Of course, dependency-correctness does not uniquely characterize the annotation behavior 
of a given F. It is possible for the annotations to be dependency-correct but inaccurate. For 
example we can always trivially annotate each part of the output with every color appearing in 
the input. This, of course, tells us nothing about the function's behavior In general, the fewer 
the annotations present in the output of a dependency-correct F, the more they tell us about F's 
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behavior. We therefore consider a function F to be minimally annotated if no annotations can be 
removed from F's output for any v without damaging correctness. 



Example 3.5. For example, consider the ordinary function 

f{x,y) '- 



Then the function 



y : X = 

x + 1 : X ^ 

.y )-<^ (x + i)-." -.x^o ■ 

is dependency-coiTcct: trivially so, since it always propagates all annotations from the input to 
each part of the output. Conversely, 

y"''' ■.x = Q 

(a; + l)° :x^0 

is dependency-correct and minimally annotated. To see that G is dependency-correct, note that 
if we evaluate G{x, y) on x then changing only the value of y (annotated by b) can never 
change the result. To see that G is minimally annotated, it suffices to check that removing any of 
the annotations breaks dependency-correctness. 

We say that a query e is constant if |e]7 = v for some v and every suitable 7. Clearly, a query 
is constant if and only if it has a dependency-correct enrichment which annotates each part of the 
result with 0. 

Proposition 3.3. It is undecidable whether a Boolean NRC query is constant. 



Proof. Recall that query equivalence is undecidable for the relational calculus lAbiteboul et al 



199511 : for NRC, equivalence is undecidable for queries e{x),e'{x) over a single variable x. 
Given two such queries, consider the expression e ~ e{x) w e'(x) V y (definable as -i(-i(e(x) w 
e'{x)) A -ly)), where y is a fresh variable distinct from x. The result of this expression cannot be 
false everywhere since the disjunction is true for y = true, so e is constant iff |e]7 = true for 
every 7 iff e = e'. □ 

Clearly, an annotation is needed on the result of a Boolean query if and only if the query is not 
a constant, so finding minimal annotations (or minimal slices) is undecidable. As a result, we 
cannot expect to be able to compute minimal dependency-correct annotations. It is important to 
note, though, that dependency-tracking remains hard even if we consider sublanguages for which 
equivalence is decidable. For example, if we just consider Boolean expressions, finding minimal 
correct dependency information is also intractable, by an easy reduction from the validity prob- 
lem. These observations motivate considering approximation techniques, such as those in the 
next two sections. 

Remark 3.3 (Uniqueness of minimum annotations). We have shown that annotation mini- 
mality is undecidable. However there is another interesting question that we have not answered: 
specifically, given a function /, is there a unique minimally-annotated function F > flTo show 
this, one strategy could be to define a meet (greatest lower bound) operation vHw on compatible 
annotated values, lift this to compatible annotated functions {F n G){x) = F{x) □ G{x), show 
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^.((^'1,^2)*) = vr 

cond(true*, ui, U2) ~ fj^* 



Vl 



V2 = 



cond(false*, wi , V2) 



(61 A 62)*!^*= 



= (ui + 

[J{ui,...,?;„}* = 

{v{x) I a; G ui*} = 



I a; £ to}* 

{•UGTOI I |„2|}*l''ll"'lll''*^"'ll™^ll 

truell^ill^ll^^ll |«i| = |i;2 
false II"! "^11 "2 II lt;i|/|t;2 



Fig. 5. Auxiliary annotation-propagating operations 



= 


7(1) 


Pllet X — eiir\ 62}^ = 


PH7 = 




■P[ei + 6217 = 


P[sum(e)]7 = 


E(7'[e]7) 




Phe]7 = 


^(P[e]7) 


P[eiA62l7 = 


P[(ei, 62)17 = 


(P[ei]7,P[e2l7)® 


nMem = 


P[0l7 = 


0" 


P[{6}]7 = 


P[eiUe2]l7 = 


(P[ei]7) (P[e2]7) 


P[ei - 62]7 = 


P[Uel7 = 


^[e]7 


P[{6 1 X G 6o}l7 = 


Vlei ^ e2]7 = 


(P[ei]7) S (P[e2l7) 


P[if 60 then 61 else 62]7 — 



P[62](7[a;^P[6l]7]) 
(P[6il7) + (P[62l7) 

6« 

(P[6il7) A (P[62]7) 

^>(^[el7) (ie{l,2}) 

(P[6il7) - (P[62l7) 

^(^[eol7,^[eil7,7'[62l7) 



Fig. 6. Provenance-tracking semantics 



that dependency-correctness is preserved by and show that the greatest lower bound of the set 
of all dependency-correct enrichments of / exists and is dependency-correct. 

However, actually defining the meet operation on values that preserves dependency-correctness 
appears nontrivial. For example, it does not work to simply define the meet as the pointwise inter- 
section of corresponding annotations. Indeed, this is not even well-defined since the "pointwise 
intersection" of {1", l**} with itself could either be {1°, l**} or {1®, 1®}. We therefore leave the 
uniqueness of minimally annotated functions as a conjecture. 



4. Dynamic Provenance Tracking 

We now consider a provenance tracking approach in which we interpret each expression e as 
a dependency-correct a-function 7^|e]. The definition of the provenance-tracking semantics is 
shown in Figure |6] Auxiliary operations are used to define "Pi— ]; these are shown in Figure |5] 
In particular, note that we define an auxiliary operation (1/;*)+* = w*^* that adds \1/ to the 
top-level annotation of an a-value w*. 

Many cases involving ordinary programming constructs are self-explanatory. Constants always 
have empty annotations: nothing in the input can affect them. Built-in functions such as +, A, 
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propagate all annotations on their arguments to the result. For a conditional if cq then ei else 62, 
the result is obtained by evaluating ei or 62, and combining the top-level annotation of the result 
with that of eo. A constructed pair has an empty top-level annotation; in a projection, the top-level 
annotation of the pair is merged with that of the returned value. 

In the case for let-binding, note that we bind x to the annotated result of evaluating ei, and then 
evaluate 62 - It is possible for the dependencies involved in constructing x to not be propagated to 
the result, if x does not happen to be used in evaluating 62 - This safe because query expressions 
involve neither stateful side-effects nor nontermination, in contrast to most work on information 
flow and slicing in general-purpose languages. Similarly, dependencies can be discarded in pair 
projection expressions 7ri(e) and set comprehensions {e2 | x e 62}, and again this is safe because 
queries are purely functional and terminating. 

The cases involving collection types deserve further explanation. The empty set is a constant, 
so has an empty top-level annotation. Similarly, a singleton set constructor has an empty anno- 
tation. For union, we take the union of the underlying bags (of annotated values) and fuse the 
top-level annotations. For comprehension, we leave the top-level annotation alone. For flattening 
U e, we take the lifted union (U) of the elements of e and add the top-level annotation of e. Sim- 
ilarly, sijm(e) uses + to add together the elements of e, fusing their annotations with that of e. 
For set difference, to ensure dependency correctness, we must conservatively include all of the 
colors present on either side in the annotation of the top-level expression. Similarly, for equality 
tests, we must include all of the colors present in either value in the result annotation. 

Note that equivalent expressions e = e' need not satisfy ■Pfe] = T'le'J; for example, x — a; = 
but Vlx-xj^ 0^ since if 7(.t) = then Vfx - xj^ = 0'=''^. 

Example 4.1. Consider an annotated input environment 7, shown in Figure Eta), of schema 
R : {{A : int, B : int)}, S : {{C : int, D : int, E : int)} (we again use named-record syntax for 
readability). Figure |2tb) shows the provenance tracking semantics of the example queries from 
FigurelH We write 0123 as an abbreviation for the set {oi, 02, 03}, etc. Note that in the count 
example query, the output depends only on the number of rows in the input and not on the field 
values; we cannot change the number of elements of a multiset by changing field values. 

Example 4.2 (Grouping and aggregation). Consider a query that performs grouping and ag- 
gregation, such as 

SELECT A, SUM(B) FROM R GROUP BY A 
First, let 

X ^ {{A: x.A, B : {y.B \ yeR,x.A^ y.A}) \ x e R} 
When run against the environment 7 in Figure |3 a), we obtain result 

X = {{A: r\B : {ibi ^2'"^}"'^^), {A : r^,B : {l''\2''^}'''^^), {A : 2"% B : {S'""}"^^")} 

Note that since we consider collections to be multisets, we get two copies of (1, {1, 2}), one 
corresponding to ai and one corresponding to a2. Also, since the subqueries computing the B- 
values inspect the A-values, each of the groups depends on each of the ^-values. We can obtain 
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(a) 

7 = [J? := {(yl : : : : 2*^),(A : 2"%B : 3*^)}, 

5 — {(C : l^^D: 2'^\E: 3"i),(C : r^ D : E : 4"=^)}] 

(b) 

V{nA{R)Yi = {(A:l»i),(A:l»^),(A:2»^)} 

{A ■ : 1*1, C : l'=^^) : l''2,£' : 4'=2),...} 

7'[nBi3(<TA=o(i? X S))l7 = {(B:l''^£;:4'==),(B:2''^£:4'=^), 

(B : 3*^^; : 3'=i)}"i23.di2 

7'[i?Up.4/c,B/c(ncc(S))l7 = {(A : l"\i3 : l''i),(A : l'"^ B : 2"^), (A : 2"^ B : 3*^ 

(A : 1"\B : 2''i),(yl : l"^ B : 1''^)} 
VIR - PA/D.B/E{^DE{S))Yj = {{A : 1"\B : l"^), (A : 1'"^B : 2"^ )ri=3,6i23,di2,ei2 
P[sum(n^(i?))]7 = 4'"i.-2,a3 
'P[count(i?)]7 = 3 

P[cOUnt(<7A = s(ii))]7 = l"123,i>123 



Fig. 7. (a) Annotated input environment (b) Examples of provenance tracking 

the final result of aggregation by evaluating 

Y = {{A: x.A, B : s\im{x.B)) \ x X} 

=^ {{A : r\B : T'^'''''''), {A : Y'^B: S'^i^-^^i^), {A : 2°^ B : 3"i=^^^)} 

Remark 4.1. Our approach to handling negation and equality may result in large annotations in 
some cases. For example, consider {1", ~ 3*^}^. Changing any of the input locations 
a, b, c, d, e, / can cause the output to change. For example, changing 1" to 4" yields result {4, 2}, 
while changing 2^ to S*" yields result 0. Thus, we must include all of the colors in the input in the 
annotation of the top-level of the result set, since the size of the set can be affected by changes to 
any of these parts. 



Most previous techniques have not attempted to deal with negation. One exception is lCui et al, 



I 200dl 's definition of hneage. In their approach, the lineage of tuple t £ R — S would be the tupl 



t £ R and all tuples of S. While this is more concise in some cases, it is not dependency-correct 
by our definition. 

On the other hand, our approach can also be more concise than lineage in the presence of 
negation, because lineage only deals with annotations at the level of records. For example, in 
{1} — {tti{x) I X S 5*}, our approach will indicate that the output does not depend on the 
second components of elements of S, whereas the lineage of each tuple in the result of this query 
includes all the records in S. This can make a big difference if there are many fields that are never 
referenced; indeed, some scientific databases have tens or hundreds of fields per record, only a 
few of which are needed for most queries. 
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Thus, although our approach to negation does exhibit pathological behavior in some cases, it 
also provides more useful provenance for other typical queries. In any case all other approaches 
either ignore negation or also have some pathological behavior Developing more sophisticated 
forms of dependence that are better-behaved in the presence of negation is an interesting area for 
future work. 

4.1. Correctness of dynamic tracking 

In this section, we prove two correctness properties of dynamic tracking. First, we show that if 
The: TthenT'le] : .4|r] A\t\ andT'le] > i?|e], that is, the provenance semantics respects 
the typing and the ordinary semantics of e. Second, and more importantly, we show that T'le] 
is dependency-correct. We first establish useful auxiliary properties of the annotation-merging 
operation w+* and prove that the lifted operations such as + have appropriate types and enrich 
the corresponding ordinary operations. 

Lemma 4.1. Let v be an a-value and $ an annotation. Then (1) = \v\ and (2) = 

lkl|u<f. 

Lemma 4.2. In the following, assume that w, vi,V2 are in the domains of the appropriate func- 
tions. 

1 +: ^|int] X .4|int] .4|int] is color-invariant and \vi + W2I — \vi \ + |f2|- 

2 ^ : ^|{int}] A\\rit\ is color-invariant and \}2,v\ — ^ 

3 ^ : ^|bool] ^|bool] is color-invariant and \^v\ = ^\v\. 

4 A : ,A|bool] x ^|bool] ^|bool] is color-invariant and A U2I = |wi| A \v2\. 

5 For any ti,T2 and i G {1,2} we have tt^ : ^|ti x T2} -4|Ti] is color-invariant and 
|7fj(w)| ^ 7rj(|w|). 

6 For any r, we have S : ^|t] x AItJ — > ^|bool] is color-invariant and \ vi S ^2 1 = {\vi \ ps 

|W2|)- ____ 

7 For any T, we have con d : ^|bool] x^|r] x^|t] AItJ is color-invariant and |cond(u, wi, f2)| = 
if \v\ then \vi\ else |u2|- 

8 For any r, we have U : ^|{t}] x -4|{t}] -4|{r}] is color-invariant and \vi U W2I = 

\vi\ U |U2|- 

9 For any r, we have — : ^|{t}] x ^|{t}] AI{t}J is color-invariant and \vi — U2I = 

\vi\-\v2\- 

10 For any r, we have IJ : AI{t}] is color-invariant and = IJ \v\. 

Proof. Most cases are immediate. The cases for sum (^) and flattening (IJ) rely on the cases 
for binary addition and union. 

The second part of the case of difference (9) is slightly involved. We reason as follows. 

\wf' - wp\ = \{v I vewi,\v\ ^ |u;2|}*i^ll'"'ill^*^^ll'"'^ll| I vewi,\v\ ^ \W2\} 

= {v \ V e \wi\,V ^ \W2\} = \wi\ - \W2\ 

Lemma 4.3. If T h e : r then Pfel : ,4|r] ^ Afrj is color-invariant and Vlej > Slej. 
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Proof. Proof is by induction on expressions e (which determine the structure of the typing 
judgment). Most cases are straightforward, given Lemma l4~2l we show the case of comprehen- 
sions. 

— Case e = {e2 | x G ei}: 

r h ei : {ri} V,x:ti h 62 : T2 
r h {62 I X e ei} : {T2} 

First, by induction we have 7-'|ei] : ^|r] •^\{'''iY\- Hence := 7'|ei]7 is a set of 
a-values in ^|ti]. So for each v E "PleiJ^ we have 7' := ^[x :~ v] G ^|r,x:ri], hence 
P[e2l7' e AIT2I and so 

n{e2 I X G ei}l7 = {Vle2U^ := v] \ v G w}" G ^I{r2}l 

Furthermore, if a : color {color}, then we have 

a(7'[{e2 I X G ei}]7) = a{{Vle2mx -.^ v]) \ v e w}") 

= {a{Ple2l(j[x ~ v])) \ V e w^^-^^ 

= {Vle2jiamx := a{v)]) \ V & wr^''^ 

= {Ple2](a(7)[x:=«]) li^GaH}"!*! 

= {PIe2](a(7)[x:=«]) Ix'GaM"!*!} 

= {PIe2](a(7)[^ := v]) \ v € a{Vle,m 

= n{e2\xee,}]a{j) 

where we appeal to the induction hypothesis to show that T' |ei] and P |e2] are color-invariant. 
Hence V\{e2 | x G ei}] is color-invariant. 

Second, to show that 7'|{e2 | 2: G ei}] > i?|{e2 | x G ei}], we have: 

Sl{e2 I X G ei}l|7| = {e{e2W\[x := v]) \ v G 5[ei]|7|} = {f Ie2l|(7|[^ := v]) \ v & l^[ei]7|} 
= {£[e2]|(7|[x := v]) I V € \w^\} = {Sle2j\m^ ■= M) I \v\ & 

= {£le2]\m^ ■■= M) \vew}^ {£b2m^ v]\\ve w} 

= {\n^^2mx := v])\ \vew}^ \{Ple2m[x v]) \ v & w}^\ 

= |{^Ie2](7[^ := v])\v& w"}] = \{rle2m^ := v]) \ v e rfeim 

= |^I{e2 I a; G ei}]7| 

□ 

We now turn to dependency-correctness. Since Vlej is defined in terms of the special annotation- 
propagating operations introduced in Figure|5j we need to show that these operations are dependency- 
correct. We first need to estabUsh properties of =a'. 

Lemma 4.4. 

1 If w =a v' then a G a G 

2 If a ^ \\v\\ and v =a v' then v = v' . 

3 If vi =a V2 then u^*. 
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Proof. The first part is easy to establish by induction on derivations of =a, by noting that 
a e \\v\\ 'i=> a e \\v'\\ is equivalent to n {a} = \\v'\\ Ci {a} and reasoning equationally. 
For the second part, note that the rule 

Wi =a W2 a G $1 n $2 

<I>1 <I>2 

Wi =aW2 

can never apply since a ^ ^ || = ||wi|| U <f>i implies a ^ $i n <f>2- The remaining rules 
coincide with the rules for annotated value equality. 

For the third part observe that both of the rules defining =a for annotated values are preserved 
by adding equal sets of annotations to both sides. □ 

We now state a key lemma which shows that all of the lifted operations are dependency-correct. 
Many of the arguments are similar. In each case, if we know that the inputs to an operation are 
=a, we reason by cases on the structure of the derivation of =„. If any of the assumptions v =a v' 
hold because a £ $ n $' for some pair of inputs w, v' , then both outputs will also be annotated 
with a. Otherwise, the inputs must have the same top-level structure, so in each case we have 
enough information to evaluate the unlifted function and show that the results are still =a- 

The proofs for equality and difference operations are slightly different. Both operations are 
potentially global, that is, changes deep in the input values can affect the top-level structure 
of the result (trivially for «, since there is no deep structure in the boolean result). This is, 
essentially, why we need to include all of the annotations of the inputs in the result of an equality 
or difference operation. We should point out that this inaccuracy is an area where we believe 
improvement may be possible, through refining the definition of =q; but this is left for future 
work. 

Lemma 4.5. If v =a v', vi =a v'i,v2 = . . . then: 

1 Vi + V2 =a v[ + V2 

2 E^=aEy' 

3 =a ^v' 

4 Vi A V2 =a v'l A v'2 

5 7fi{v) =aTr.i{v') 

6 V2 =a v[ S v!^^ ^ 

7 cond(?j, TJi, U2) =a cond(w', uj^, tij) 

8 ViU V2 =a v[ U v'2 

9 Vi - V2 =a v[ - v'2 

10 (jv =a V}v' 

Proof. For part (1), suppose Vi ~ nf' and v'^ = mf' for i G {1, 2}. There are four cases, 
depending on the derivations of rii =a rrii for i G {1, 2}. If both derivations follow because 
nf • = mf • then Vfe]^ = (tii + 712)*^''*^ = (mi + ma)*!^*^ = Vlejj' so again 7'|el7 
P|e]7'. Otherwise one or both of the derivations follows because a G $i fl for z = 1 or i = 2. 
Then a G ($1 U $2) n (*i U "$2) so again Vfej^ =a Vle]'^'. 

For part (2), there are two cases. If the summed sets are =a because their top-level annotations 
mention a, then the results of the sums will also mention a, so we are done. Otherwise, we must 



20 



have that the summed sets are of equal size and their elements are pairwise matched by =a', 
hence, we can apply part (1) repeatedly (and then Lemma l431 l to show that the results are Po- 
parts (3,4) are similar to part (1). 

For part (5), suppose v = {vi , f 2)* and v' — {v'^ , W2)* ■ Note that 

■Ki{v) = 7r,;(ui,W2)* = vf"^ 

and similarly TTi{v') = (^0^* ■ There are two cases depending on the last step in the derivation 
of V =a v' . If a G ^ O $' then we are done since a will be in the top-level annotations of both 
v^"^ and (w^)^* . Otherwise we must have Vi =a v[ for i £ {1, 2}, so again w,^* =a (^0^* ■ 

For part (6), there are two cases. If a e {\\vi\\ U ||w2||) H {\\v[\\ U ||w2||) then we are done. 
Otherwise by Lemma l4~4l a cannot appear anywhere in vi, ^2, t^i, ^2, so we must have vi = 
v[,V2 = Hence {vi Si V2) ~ {v[ S V2) which implies the two sides are as well. 

For part (7), suppose u = 6*, u' = ■ € <i>n$' then we are done since both condition- 
als will have a in their top-level annotation. Otherwise we must have b = b' so cond(u, ui , W2) = 
Vi and cond(w', v[, Wj) = so by induction (and Lemma l4~4l i we are done. 

For part (8), suppose Vi ~ wf' and similarly for v[. Again if a G (<i>i U $2) H {^[ U $2) 
then we are done. Otherwise we must have that wi = {vu, . . . , wi„}, w'l = {v[i, . . . , v[^} 
where Vu =a v[j^ for each i G {1, . . . , n}, and similarly for W2,w'2- Hence the elements of the 
union of the two multisets can be matched up using the =a relation, so we can conclude that 
Wi U W2 =0 uj[ U W2 as well. We must also have $i = for each i e {1, 2}, so we can 
conclude that 

Vl U V2 = (Wi U W2) '=a{WiLIW2) ^ ^ = U V2 

For part (9), the reasoning is similar to part (6). 

For part (10), the reasoning is similar to that for part (2), appealing to part (8) once we have 
expanded to binary unions. □ 

We conclude the section with the proof of dependency-correctness. It is much simplified by 
the previous lemma, since many cases now consist only of applying the induction hypothesis and 
then using dependency-correctness of a lifted operation. 

Theorem 4.1. If F h e : t then Vie] is dependency-correct. 

Proof. Suppose 7 =a 7'. Again proof is by induction on the structure of expressions/typing 
derivations. Many cases are immediate using the induction hypothesis and the corresponding 
parts of Lemma 14. 5 1 We show the remaining cases: 

— Case e = x: 

x:t g r 
r h a; : r 

By assumption T'lsJ-y ~ jix) =a j'i^) ~ T^l^ll' ■ 

— Case e = let a- = ei in 62'- 

F f- ei : Ti r, x:ti h 62 : T2 
F h let a; = ei in 62 : T2 

By induction P|ei] and 7-'|e2] are dependency-correct. Hence 7'|ei]7 =a VleiYy', so 
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--a l'[x 



7[x := 7'Iei]7] 
Case e = (ei, 62): 



7'|ei]7']. It then follows by induction that 7'|e]7 = 7'|e2](7[a; := 
r h ei : Ti r h 62 : T2 

r h (61,62) : Ti X T2 

By induction, Plei] and 7^162] are dependency-correct, so Vi = 'P|ei]7 =a V\eiY)' — v[ 
for i G {1, 2}. Hence we can immediately derive (wi, W2)® =a Wi^ '^2)^- 

— Case 6 ~ {e'}: 

r h 6^ T 
r h {6'} : {r} 

By induction, V\e'\ is dependency-correct, so u = 7^Ie']7 =a 'P\e'Y)' — v' . Hence we can 
immediately derive {u}^ =a W}'^- 

— Case 6 = {62 I X G 61}: 

r h 61 : {n} r, x:ti h 62 : T2 
r h {62 I a; G 61} : {t2} 

By induction, 7-* |6i] and7-'|62] are dependency-correct. Hence wf^ = 7-'|6i]7 =q 'P|6i]7' = 
{wi)"^^ ■ There are two cases. If a G $1 H $1 then we are done since 'P|6]7 and T'lejT'' will 
both contain top-level annotations a. Otherwise, we must have 



Vll =a w'n 



$1 = $'1 



Wl 



where wi = {un, . . 
7[a; wn] =a l'[x 



<J>1 / / \cj)' 

, , Vin} and similarly for w'^. Thus, for each i G {!,..., n}, we have 
= Vij\. It follows that for some V2i and Wji^ we have W2 j = ^'[62] {^\x := 
= = for each i G {1, . . . , n}. Thus, we can derive 



"21 =a W21 



W2n =a W2n 



$1 = $; 



W2 



*i - 



where 



and similarly 



=a {w'2) ' 

{Ple2lij[x := v]) I V G u-i}*! = 7'[{62 | x G ei}l7 



- {Ple2](7'[.T I.]) I V G = 7'|{62 | x G 6i}l7' 

So we can conclude that ■P|{e2 | a; G ei}]7 =a 'P\{e2 \ x G ei}]7'. 
This exhausts all cases and completes the proof. 



□ 



5. Static Provenance Analysis 

Dynamic provenance may be expensive to compute and nontrivial to implement in a standard 
relational database system. Moreover, dynamic analysis cannot tell us anything about a query 
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without looking at (annotated) input data. In a typical large database, most of the data is in sec- 
ondary storage, so it is worthwhile to be able to avoid data access whenever possible. Moreover, 
even if we want to perform dynamic provenance tracking, a static approximation of depend- 
ency information may be useful for optimization. In this section we consider a static provenance 
analysis which statically approximates the dynamic provenance, but can be calculated quickly 
without accessing the input. 



We formulate the analysis as a type-based analvsis lPalsbera |200 III : annotated types (a-types) 
T and raw types (r- types) ui are defined as follows: 



int I bool 



We write T for a typing context mapping variables to a-types. We lift the auxiliary a-value oper- 
ations of erasure (|r|) and annotation extraction (||t||) to a-types as follows: 



|int| 
|bool| 

In X T2I 



int 
bool 

In I X 1^2! 

{|r|} 



||int| 
||bool| 

\ti X ral 

iimi 

lk*l 



lln||U||?2| 
\\r\\ 

\\lu\\ U $ 



Moreover, we define compatibility for a-types analogously to compatibility for values, that is, 
Tl and T2 are compatible (ri = T2) provided |ri| = |t2|. Also, we say that an a-type enriches 
an ordinary type t (written t > t) provided |t| = t. These concepts are lifted to a-contexts F 
mapping variables to types in the obvious (pointwise) way. 

We also define a merge operation U on compatible types as follows: 



intU int 
bool U bool 

(ri X ?2) U (f{ X f^) 
{r}U{?'} 



int 
bool 

(nUrJ) X (faUf^) 



Finally, we write r □ r' if r' = r U r'; this is a partial order on types and can be viewed as a 
subtyping relation. 

We interpret a-types r as sets of a-values AfrJ. We interpret the annotations in a-types as 
upper bounds on the annotations in the corresponding a-values: 



^lint] 
llbool] 

Al?l X ?2j 

Mm 

Mn 



{i\ieZ} 

{b I b G B} 
Al?ij X Al?2l 
MUAlri) 

{u;* I * C $,w e AM} 



The syntactic operations | — |, || — 1|, C and U on types correspond to appropriate semantic opera- 
tions on sets of a-values. We note some useful properties of these operations: 

Lemma 5.1. 

I If w G Al?l theniJ G ^||f|l and \v\ G r||f|] and ||w|| C ||f||. 
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x:t £ f f h ei : Ti T, a: : ri h 62 : r2 

r \- X : T r h let s: = ei in e2 : T2 

f h ei : int fc $1 F h 62 : int fc $2 f h e : {int'^°} fc ^ 

fl-i:int&0 r h ei + e2 : int & $1 U $2 F h sum(e) : int & $0 U $ 

f h e : bool fc $ f h ei : bool fc $1 f h 62 : bool fc $2 

f\-b: bool & f h -.e : bool & <1> f h ei A £2 : bool & $1 U $2 

f I- ei : n f h 62 : ?2 f h e : ^ x u;p fc $ 

(«G{1.2|) 

r I- (61,62) : (n X ?2) &0 rh 7ri(6) : oJi & $i U$ 

f I- 61 : ri f h 62 : r2 ri r2 f h ep : bool fc $0 f h 61 : fi f h 62 : r2 ri ^ r2 

r h 61 « 62 : bool & ||ri|| U ||r2|| f h if 60 then 61 else 62 : (n U T2)+*'' 

fhe:? f h 61 : {n} fc $1 f h 62 : {?2} & $2 ?! 

fh0:{?}&0 f|-{6}:{?}&0 r h 61 U62 : {?i U?2} & $1 U$2 

f h 61 : {?i} & $1 r,x:n h 62 : w & $2 F h 6 : {{f}*^ } & $1 

f h {62 I e 61} ; & $1 f h Ue : {?} & $1 U $2 

fF6i:{n}&$i f h 62 : {?2} & "I'2 n=r2 

f h ei - 62 : {n} & ||{n}*i || U ||{?2}*^ || 

Fig. 8. Type-based static provenance analysis 



2 If?i ^ ?2 then fi Ufa is defined and ^pi Ufa] 3 ^IfilU^pal and UnUfa || = ||fi||U||f2||. 

3 If fi □ fa then Affij C Ipa] and ||fi|| C ||f2||. 

Figure [8] shows the annotated typing judgment F h e : f (sometimes written F h e : w & $ 
for readabiUty, provided f — oj*), which extends the plain typing judgment shown in Figure|2] 

Proposition 5.1. The judgment F h e : t is derivable if and only if for any F > F, there exists 
a f > T such that F h e : f . Moreover, given F h e : r and F > F, we can compute f in 
polynomial time (by a simple syntax-directed algorithm). 

Example 5.1. Consider an annotated type context F, shown in Figure |9t a), where we have an- 
notated field values A, B,C, D, E with colors a,b,c,d,e respectively. Figure |9tb) shows the 
results of static analysis for the queries in Figure |7] In some cases, the type information simply 
reflects the field names which are present in the output. However, the colors are not affected by 
renamings, as in Pa/c.b /d- Furthermore, note that (if we replace the colors a, 6, c, c?, e with color 
sets {oi, 02, aa}, etc.) in each case the type-level colors safely over-approximate the value-level 
colors calculated in Figure |2l 



Example 5.2. To further illustrate the analysis, we consider an extended example for a query 
that performs grouping and aggregation (equivalent to the one in Example l4.2b : 

Q{R) = {(7ri(x-),sum(G(x))) | x e i?} 
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(a) 

f=[R:{(A: int", B : int'')}, S : {(C : int", D : mt'^, E : mt")}] 

(b) 

fhnA(i?) : {{A:mn} 

VhaA=B{R) ■■ {(A : int",_B : int'")}"''' 

VhRxS : {(A : int°,_B : int^C : int=,D : int'',^ : inf^)} 

fh RUpa/c%b/d{TIcd{S)) : {(A : int°■^ B : int"''')} 

fh R- pA/D,B/E{nDE{S)) : {{A:mt'',B ■.mt")}"'''''''^ 

f h sum(nA(-R)) : int" 

r I- count(i?) : int 

f h count((TA=s(i?)) : int"'*' 

Fig. 9. (a) Annotated input context (b) Examples of provenance analysis 



where we employ the following abbreviations: 



G{x) := |J{if 7ri(y) W7ri(x)then {7r2(y)}else0 I y e i?} 

tr := int" X int'' 
f R:{?r} 

^2 Ti,y:T]i 

We will derive F h Q{R) : {int*^ x int" ''}. The derivation illustrates how color a is propagated 
to both parts of the result type, while color b is only propagated to the second column. 
First, we can reduce the analysis of Q to analyzing G{x) as follows: 

f 1 h X : ffl fih- Gix) : {int^r 



Fi h TTi{x) : int" Fi h sum(G'(x)) : int' 
f h i? : {tr} f 1 h (7ri(a;),sum(G(x))) : int" x int"-'' 
f h {(7ri(a;),sum(G'(.T))) \ x e R} : {int" x int"-''} 

We next reduce the analysis of G{x) to an analysis of the conditional inside G{x): 

fi h E : fa h if 7ri(y) ^ 7ri(x) then {^2(2/)} else : {int''}" 

fi h {if 7ri(y) ^ mix) then {ttzI;/)} else $ \ y € R} : {{int^}"} 
Fi h U{if My) « then {7r2(y)} else | y e i?} : {int**}" 
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Finally, we can analyze the conditional as follows: 

^2\-y -TR f 2 I- X : ffl ^2^ y -TR 

r2 H7ri(y) : int° fa h ^i(x) : int° h^2(;/) : int'' 

f2 h7ri(y) W7ri(a;) : boor fa h {TTafa)} : {int''} f 2 h : {int} 

f 2 ^ if My) ~ Mx) then {My)} else : {int''}'^ 

5.1. Correctness of static analysis 

The correctness of the analysis is proved with respect to the provenance-tracking semantics given 
in Section |4] which we have already shown dependency-correct. Correctness is formulated as 
a type-soundness theorem, using the refined interpretation Al—} of a-types. Specifically, we 
show that if r h e : r then Vlej : AlTj AItJ. Theorem 15 .21 immediatelv implies that the 
annotations we obtain (statically) by provenance analysis conservatively over-approximate the 
dependency-correct annotations we obtain (dynamically) by provenance tracking provided the 
initial value 7 matches ^|r]. 

We first establish that the static analysis is a conservative extension of the ordinary type sys- 
tem: 

Lemma 5.2. If F h e : t then for any F > F there exists a t > r such that F h e : r. 

Proof. Structural induction on derivations; again the only interesting steps are those involving 
compatibiUty side-conditions; typically we only need to observe that if ti, T2 > r then ri = T2, 
so Ti U T2 exists and ri = T2 = ti U r2. □ 

Lemma 5.3. If F h e : r then |F| h e : |t|. 

Proof. Straightforward induction on derivations; cases with compatibility side-conditions re- 
quire observing that by definition ri = r2 = |t2|. □ 

Lemma 5.4. Every context F has at least one enrichment F > F. 

Proof. Observe that any type can be lifted to an a-type by annotating each part of it with 0. 
An unannotated context F can be lifted to a default annotated context F by lifting each type. □ 

Theorem 5.1. The judgment F h e : r is derivable if and only if for any F enriching F, there 
exists a r enriching t such that F h e : t is derivable for some r enriching r. 

Proof. For the forward direction, we use Lemma 15.21 For the reverse direction, suppose the 
second part holds for a given F, e, r. By Lemma 15341 we have F h e : t for some F enriching F 
and T enriching r. Hence by Lemma |531 we have |F| h e : |r|, but clearly |F| = F and |t| = r. 

□ 

We next establish useful properties of the a-value operations with respect to the semantics of 
annotated types: 

Lemma 5.5. For any <i>, $1, $2, r, fi, f2: 

1 +: llint*! X i:|int*l ^ .Afint*^*!. 
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2 E:-4[{int*}*l->^Iint*^*]. 

3 ^ : ^|bool*l -> ^|bool*]. 

4 A: ^Ibool*] X ^|bool*] ^ l|bool*^*l. 

5 Wi : Aliri X ^fj ^ for any i G {1, 2} 

6 S: X ^[rsf-^ ^Ibool""^"^!!"^"]. 

7 If fi then cond : /|bool*] x ^|fi] x Al?2j ^ U f2)+*] 

8 If fi - ?2 then : x ^{fs}*^! ^ 4{n U fz}*^^*^!- 

9 If fi = ?2 then - : ^[{fi}*^ x il{?2}*1 ^ _4;||^^|*iu||ri||u*.ui|r2||j^ 

10 ■■ - 

Proof. All of the properties are immediate from the definitions of the operations. □ 

Theorem 5.2. If F h e : f thenT'le] : ^|f] ^ Al?}. 

Proof. The proof is by induction on the structure of expressions (and the associated anno- 
tated derivations). As before, many of the cases follow immediately by induction and appeals to 
Lemma l53] 

— Case e — x: 

x:t £ r 
T \~ X : T 

Note that Vlxjj = ^{x) e Al?} since 7 e ^|r]. 

— Case e = (let x = ei in 62): 

r h ei : Ti r, g;:?! h 62 : r2 
r h let a; = ei in 62 : r2 

By induction on the first subderivation, we have P|ei]7 G Hence j[x := 7'|ei]7] G 

Air, xiTiJ, so by induction on the second subderivation, we have 7-'|e]7 = ^Ie2]7([2^ := 
Vle^m e Alr2l 

— Case e = (ei, 62): 

r h ei : fi r h 62 : r2 
f h (ei,e2) : (?i x f2) & 

By induction, Vle,}^ G Mnl Thus (7'Ieil7, V^jf G ^(fi x 

— Case e — {e'}: Similar to the case for pairing. 

— Case e = {e2 | a; G ei}: 

r h ei : {fi} & <i>i r, x:ti h 62 : f2 
f h {es I x G ei} : {T2} & «>i 

Let w* = Plei]7; then by induction w* G ^[{n}*!] and so w G ^[{n}] and C $1. 
Hence for each w G w, we have v G .4|Ti], so j[x := v] G ^|r, a;:Ti]. Thus, for each such 
V, by induction we have ■P|e2](7[x := v]) G .4|t2]. Moreover, PlelT- = {'P|e2](7[a; := 
v])\ve w*} = {Ple2l(7[a; ~ v]) \ v e w}* G ^I{f2}*i since * C $1. 

□ 
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6. Discussion 



We chose to study provenance via the NRC because it is a clean and system-independent core 
calculus similar to other functional programming languages for which dependence analysis is 
weU-understood. We believe our results can be speciaUzed to common database implementa- 
tions and physical operators without much difficulty. We have not yet investigated scaling this 
approach to large datasets or incorporating it into standard relational databases. 

We have, however, implemented a prototype NRC interpreter that performs ordinary type- 
checking and evaluation as well as provenance tracking and analysis. Our prototype currently 
displays the input and output tables using HTML and uses embedded JavaScript code to high- 
light backward slices, that is, the parts of the input on which the selected part of the output may 
depend, according to the analysis. Similarly, the system displays the type information inferred 
for the query and uses the results of static analysis to highlight relevant parts of the input types 
for a selected part of the output type. 

In the worst (albeit unusual) case, a part of the output could be reported as depending on every 
part of the input, as a result of spurious dependencies. For example, this is the case for a query 
such as (i?i — U • • • U {Rn — Ue. Of course, this query is equivalent to e and a good query 
optimizer will recognize this. However, for non-pathological queries encountered in practice, our 
analysis appears to be reasonably accurate. Even so, typically the structure of the output depends 
on a large set of locations, such as all of the fields in several columns in the input used in a 
selection condition; individual fields in the output usually also depend on a smaller number 
of places from which their values were computed or copied. Thus, implementing provenance 
tracking in a large-scale database may require developing more efficient representations for large 
sets of annotations, especially the common case where a part of the output depends on every 
value in a particular column. 

The model we investigate in this article is similar to that of iBuneman et al. I i2008bll m many 
respects. There are two salient differences. The first difference is that iBuneman et al.l l2008bl] 
propagates annotations comprising single (optional) input locations, whereas our approach prop- 
agates annotations consisting of sets of input locations. The second difference is that our ap- 
proach provides a strong semantic guarantee f ormulated in terms of d ependence, whereas in 
contrast the semantics of where-provenance in IBuneman et al. I i2008bll is an ad hoc syntactic 
definition justified by a database-theoretic expressiveness result, not a dependency property. 

Of these differences, the second is more significant. Their results characterize the possible 
where-provenance behavior of queries and updates precisely, but tell us little about what might 
happen if the input is changed. Moreover, their expressiveness results have not yet been extended 
to handle features such as primitive operations on data values and aggregation. To illustrate the 
distinction, observe that in the example in Figure [T] the Name fields are copied from the input to 
the output (thus, they have where-provenance in Buneman et al.'s model) but the AvgMW fields 
are computed from several sources, not copied (thus, they would have no where-provenance, even 
though they depend on many parts of the input). We believe the approaches are complementary: 
each does something useful that the other does not, and in general users may want both kinds of 
provenance information to be available. 



Buneman et alJ i2008bll discuss implementing provenance tracking as a source-to-source trans- 



lation from NRC queries to NRC extended with a new base type color. The idea is to translate 
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ordinary types to types in which each subexpression is paired with an annotation of type color, 
and translate provenance-tracking queries to ordinary NRC queries over the annotated types that 
explicitly manage the annotations. This implementation approach has the potential advantage 
that we can re-use existing query optimization techniques for NRC. A similar query-translation 
approach should be possible for dependency provenance, by explicitly annotating each part of 
each value with a set of annotations {color} and using NRC set operations to propagate colors. 

Most database systems only implement SQL, which is less expressive than NRC since it does 
not provide the ability to nest sets as the field values of relations. Nevertheless, it still may 
be possible to support some annotation-propagation operations within ordinary SQL databases. 
Suppose we are interested in a particular application in which annotations are numerical times- 
tamps or quality rankings that can be aggregated (e.g. by taking the minimum or maximum). In 
this case, we can propagate the annotations from the source data to the results according to the 
provenance semantics. Simple SQL queries can easily be translated to equivalent SQL queries 
that automatically ag gregate annotations in this way, using techniques similar to those used in 
the DBNotes system Ishagwat et al.lboosll . 

For example, consider the query 

SELECT A, SUM(B) FROM R GROUP BY A 

over relations i? : {{A : '\nt,B : int)}. Suppose we have relations i? : {{A : int, A,, : int, i? : 
int, Bq : int)}, in which each field A, B has an accompanying quality rating Aq, Bq. Then we 
can translate the above SQL query to 

SELECT A, MIN(A_q), SUM(C), MIN(C_q) FROM R GROUP BY A 

to associate each value in the output with the minimum quality ranking of the contributing 
fields — thus, data in the result with a high quality ranking must depend only on high-quality 
data. However, performing this translation for general SQL queries appears nontrivial. It is well 
known that flat NRC queries whose input and output types do not involve nested set types and 
that do not involve grouping or aggregation and m ap sets of rec ords to sets of records can be 
translated back to SQL via a normalization process I Wong . 1996ll . but it is apparently not well- 
understood how to extract SQL queries from arbitrary NRC expressions involving grouping and 
aggregation. 

We can easily implement static provenance tracking for ordinary SQL queries by translating 
them to NRC; this does not require changing the database system in any way, since we do not 
need to execute the queries. Static provenance analysis is slightly more expensive than ordinary 
typechecking, but since the overhead is proportional only to the size of the schema and query, not 
the (usually much larger) data, this overhead is minor. Moreover, static analysis may be useful 
in optimizing provenance tracking, for example by using the results of static analysis to avoid 
tracking annotations that are statically irrelevant to the output. 

Consider for example the following scenario: After running a query, the user identifies an er- 
ror in the results, and requests a data slice showing the input parts relevant to the error We can 
first provide the results of static provenance analysis and show the user which parts of the input 
database contain data that may have contributed to the error. In a typical relational database, this 
would narrow things down to the level of database tables and columns, which may be enough 
for the user to fix the problem. In case this is not specific enough, however, we can still employ 
the static analysis to speed computing the dynamic provenance. Using the static provenance in- 
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formation, we know that only locations in the input data that correspond to the static provenance 
of the output location of interest can contribute to that output location. Hence, if we are only 
interested in the dynamic provenance of a single output location, we might avoid the overhead 
of dynamically tagging and tracking provenance for parts of the input that we know cannot con- 
tribute to the output part of interest. We plan to investigate this potential optimization technique 
in future work. 



6. 1 . Comparison with slicing, information flow and other dependence analyses 



The techniques in Sec tionBland Section|5]draw upon standard techniques in static analysis iNielson et al, 



2005 , Palsberj, 2001 1 . In particular, the idea of instrumenting the semantics of programs with 



labels that capture interesting dynamic properties is a well-known technique used in control-flow 
analysis and information flow con trol. Moreover, it appears possible to cast our results in the 
abstract inte rpretation framework ICousot and CousotL 1197711 that is widely used in static anal- 
ysis (see e.g. iNielson et al.l 120051 ch. 4] for an introduction). Doing so would require adapting 
abstract interpretation to handle collection types. This does not appear difficult but we preferred 
to keep the development in this article elementary in order to remain accessible to nonspecialists. 

Dependence tracking and analysis have been shown to be useful in many contexts such as 
program slicing, information-flow security, incremental update of computations, and memoiza- 
tion and caching. A great deal of work has been done on each of these topics, which we cannot 
completely survey here. We focus on contrasting our work with the most closely related work in 
these areas. 

In program slicing I Biswas , 1997 , Field and Tipl 1998 , Wei set , 1981 1. the goal is to identify 
a (small) set of program points whose execution contributes to the value of an output variable 
(or other observable behavior). This is analogous to our approach to provenance, ex cept that 
proven ance identifies relevant parts of the input database, not the program (i.e. query). jCheney 
1 200711 discusses the relationship between program slicing and dependency analysis at a high 
level, complementing the technical details presented in this article. 

In computer security, it is of ten of interest to specify and enforce information-flow poli- 
cies ISabelfeld and Mversl 1200311 that ensure that information marked secret can only be read 
by privileged users, and that privileged users cannot leak secret information by writing it to pub- 
lic locations. These properties are someti mes referred t o as secrecy and inte grity, respectively. 
Both can be enforced using static (e.g. I Myers , 19991 Volpano et al. . 1996ll or dynamic (e.g. 
1 Shroff et al.L l2008l Ijia et al.L 1200811 ) dependency tracking techniques. Our work is closely re- 
lated to ideas in information flow security, but our goal is not to prevent unauthorized disclosure 
but instead to explicate the dependencies of the results of a query on its inputs. Nevertheless there 
are many interesting possible connections that need to be explored, particularly in relating prove- 



nance to dynamic information flow tracking llShroff et al 



200811 and integrati ng provenance 



secur i ty policies with other access-control, information-flow and audit poUcies ISwamy et al 



20081 Llia et al.L 12008 1 



In contrast to most work on static analysis and information flow security, we envision the 
instrumented semantics actually being used to provide feedback to users, rather than only as the 
basis for proving correctness of a static analysis or preventing security vulnerabilities. This makes 
our approach closest to (dynamic) slicing. The novelty of our approach with respect to slicing is 
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that we handle a purely functional, terminating database query language and focus on calculating 
dependencies and slicing information having to do with the (typically large) input data, not the 
(typically small) query. On the one hand the absence of side-effects, higher-order functions, 
nontermination, or high-level programming constructs such as objects and modules simplifies 
some technical matters considerably, and enables more precise information to be tracked; on the 
other, the presence of collection types and query language constructs leads to new complications 
not handled in prior work on inf ormation flow, slicing or static analysis. 

In self-adjusting computation I Acar et al. . 2008 . Acar . 2009ll . support for efficient incremental 
recomputation is provided at a language level. Programs are executed using an instrumented se- 
mantics that records their dynamic dependencies in a trace. Although the first run of the program 
can be more expensive, subsequent changes to the input can be propagated much more efficiently 
using the trace. We are currently investigating further applications of ideas from self-adjusting 
computation to provenance, particularly the use of traces as explanations. 



De pendency tracking is also importa nt in memoiz a tion a nd caching techniques IIAbadi et al 



19961 lAcar et al.L 1200311 . For example, lAbadi et al.l 1199611 study an approach to caching the 



results of function calls in a soft ware configuration management system, based on a label- 
propagating operational semantics. lAcar et al.l ll2003ll develop a language-based approach to 
memoizing and caching the results of functional programs. Our work differs from this work in 
that we contemplate retaining dependency information as an aid to the end-user of a (database) 
system, not just as an internal data structure used for improving performance. 

Our a pproach to provenan ce tracking based on dependency analysis has been used in the Fable 
system OSwamv et al.U2008ll . In this work provenance is one of a large class of security policies 
that can be implem ented using Fable, a d ependently-typed language for specifying security poli- 
cies. Subsequently. ISwamv et al. I i2009ll have explored a theory of typed coercions that can be 
us ed to implement de pendency provenance. 

Abadi et al. I il999ll argue that techniques such as slicing, information-flow security, and other 



program analyses such as binding-time analysis can be given a uniform treatment by translating 
to a common Dependency Core Calculus. We believe provenance may also fit into this picture, 
but in this article, we considered both dynamic and static labeling, whereas the Dependency Core 
Calculus only allows for static labels. Another difference is that the Dependency Core Calculus 
is a higher-order, typed lambda-calculus whereas here we have considered the first-order nested 
relational calculus. It would be interesting to develop a common calculus that can handle both 
static and dynamic dependence and both higher-order functions and collections, particularly if 
dynamic information flow and dynamic slicing could also be handled uniformly. 



7. Conclusions 

Provenance information that relates parts of the result of a database query to relevant parts of the 
input is useful for many purposes, including judging the reliability of information based on the 
relevant sources and identifying parts of the database that may be responsible for an error in the 
output of a query. Although a number of techniques based on this intuition have been proposed, 
some are ad hoc while others have proven difficult to extend beyond simple conjunctive queries 
to handle important features of real query languages such as grouping, aggregation, negation and 
built-in operations. 
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We have argued that the notion of dependence, familiar from program slicing, information flow 
security, and other program analyses, provides a solid semantic foundation for understanding 
provenance for complex database queries. In this article we introduced a semantic characteriza- 
tion of dependency provenance, showed that minimal dependency provenance is not computable, 
and presented approximate tracking and analysis techniques. We have also discussed applications 
of dependency provenance such as computing forward and backward data slices that highlight 
dependencies between selected parts of the input or output. We have implemented a small-scale 
prototype to gain a sense of the usefulness and precision of the technique. 

We believe there are many promising directions for future work, including implementing ef- 
ficient practical techniques for large-scale database systems, identifying more sophisticated and 
useful dependency properties, and studying dependency provenance in other settings such as 
update languages and workflows. 
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